5 research outputs found

    Performance and scalability of indexed subgraph query processing methods

    Get PDF
    Graph data management systems have become very popular as graphs are the natural data model for many applications. One of the main problems addressed by these systems is subgraph query processing; i.e., given a query graph, return all graphs that contain the query. The naive method for processing such queries is to perform a subgraph isomorphism test against each graph in the dataset. This obviously does not scale, as subgraph isomorphism is NP-Complete. Thus, many indexing methods have been proposed to reduce the number of candidate graphs that have to underpass the subgraph isomorphism test. In this paper, we identify a set of key factors-parameters, that influence the performance of related methods: namely, the number of nodes per graph, the graph density, the number of distinct labels, the number of graphs in the dataset, and the query graph size. We then conduct comprehensive and systematic experiments that analyze the sensitivity of the various methods on the values of the key parameters. Our aims are twofold: first to derive conclusions about the algorithms’ relative performance, and, second, to stress-test all algorithms, deriving insights as to their scalability, and highlight how both performance and scalability depend on the above factors. We choose six wellestablished indexing methods, namely Grapes, CT-Index, GraphGrepSX, gIndex, Tree+∆, and gCode, as representative approaches of the overall design space, including the most recent and best performing methods. We report on their index construction time and index size, and on query processing performance in terms of time and false positive ratio. We employ both real and synthetic datasets. Specifi- cally, four real datasets of different characteristics are used: AIDS, PDBS, PCM, and PPI. In addition, we generate a large number of synthetic graph datasets, empowering us to systematically study the algorithms’ performance and scalability versus the aforementioned key parameters

    Improving the performance and scalability of patten subgraph queries

    Get PDF
    Graphs have great representational power, and can thus efficiently represent complex structures, such as chemical compounds and social networks. A common problem that often arises to graphs is the subgraph pattern matching querying problem, where given a graph DB and a query in the form of a graph, the graphs from the DB that contain the query are returned. In some algorithms, all possible occurrences of the query graph in the DB graphs are additionally returned. The subgraph matching problem entails subgraph isomorphism which is known to be NP-Complete. To alleviate the problem, a large number of methods has been proposed over the years that can be classified in two major categories: (i) the filter-then-verify (FTV) and (ii) the subgraph isomorphism (SI) methods. Specifically, the FTV methods rely on a constructed index with the aim to filter out graphs from the DB that definitely do not contain the query graph as an answer. On the remaining set of graphs, which form the so-called candidate set, a subgraph isomorphism algorithm is applied to verify whether the query graph is indeed contained in the DB graph. SI methods target in optimizing their subgraph isomorphism testing process by suggesting different heuristics. With our work, we confirm that both FTV and SI methods suffer from significant performance and scalability limitations, stemming from the NP-complete nature of the subgraph isomorphism problem. Instead of trying to devise new algorithms with better performance compared to the already existing ones, we take a different approach. We suggest a number of solutions to improve their performance and to extend their scalability limitations. In more detail, we conduct a comprehensive analysis of the state of the art FTV methods. We initially identify a set of key-factor parameters that influence the performance of related methods, namely the number of nodes and density per graph, the number of distinct labels and graphs in the graph DB, and the size of the query. Subsequently, using the aforementioned parameters, we perform a large number of experiments with both real and synthetic datasets in a systematic way, where we report on indexing time and size, query processing time and filtering power. We analyze the sensitivity of the various FTV methods. Our analysis helps us draw useful conclusions about the algorithms relative performance. In parallel, we stress-test them and thus, we recognize different scalability limitations, i.e., points where some algorithms operate while others break. One of the conclusions drawn from our experiments with the FTV methods is that as the graphs in the dataset grow large in the number of nodes and/or density and as the query size increases query processing becomes harder. Thus, we additionally bring into the play the state of the art SI methods and along with the top-performing FTV methods as indicated by our aforementioned analysis, we investigate whether all queries of the same size are equally challenging. First, our experiments reveal that all proposed methods suffer from stragglers, i.e., queries with execution times many orders of magnitude worse compared to the majority of them. Second, through our experiments we have seen that isomorphic queries can have widely and wildly different execution times on the various algorithms. Thus, we propose our own isomorphic query rewritings that can introduce large performance gains. Third, we observe that stragglers are algorithm specific, i.e., a straggler query on one algorithm can be a typical query on some other algorithm. We incorporate our findings in a novel proposed framework, coined Psi-framework that runs in parallel different isomorphic instances of the original query and/or different algorithms. Such parallel executions of various algorithms have been used for other NP-hard problems and are known as portfolios of algorithms. Our framework introduces large performance gains in the subgraph matching problem, on both FTV and SI methods across all employed datasets, where some combinations of algorithms perform better than others. Similar to Psi-framework, some portfolios are more favorable than others. Recent proposed methods tend to totally dismiss FTV methods and employ SI methods instead, with the claim that the SI methods enjoy shorter query execution times and that managing the index-based FTV methods is too costly. With our work, we investigate this claim. We initially quantify the constructed index of state of the art SI methods and the top performing FTV method in terms of time and size and we evaluate the efficiency of the constructed indices in filtering out graphs that do not contain the query. Based on our experiments, in both real and synthetic datasets, SI methods fail to avoid a large number of redundant subgraph isomorphism tests. Additionally, our experiments on the SI methods fail to indicate a single-winner. Thus, we propose a hybrid FTV-SI method, as a combination of the filtering achieved by the top-performing FTV method and the verification of various SI methods. This hybrid FTV-SI combination was not studied before, perhaps surprisingly for the problem at hand. Based on our experiments, such a hybrid combination brings high speedups in the subgraph matching problem. In an attempt to reduce even more the underlying indexing costs, we additionally experiment with different values of the enumerated features. Our experiments reveal that we can still achieve high quality filtering, even with smaller features, whereas the overall query execution time is still significantly boosted. With our research results, we hope to open up a whole new research trend where community will benefit from already existing solutions by combining them appropriately to achieve large performance gains

    Improving the Performance and Scalability of Pattern Subgraph Queries

    Full text link
    The data provided include the datasets used the PhD thesis titled "Improving the Performance and Scalability of Pattern Subgraph Queries". The thesis contains 7 chapters., from which chapters 4,5 and 6 are related to the dataset provided. The rest of the chapters serve as introductory and concluding material to the thesis. All datasets follow 2 different file formats: grapes and igraph format as described in the README file. Most of the datasets provided are generated with the synthetic generator GraphGen ( http://www.cse.ust.hk/graphgen/). For the the real datasets: they were obtained as follows: AIDS, PDBS, PCM, PPI were retrieved from authors of Grapes. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3805575/ Unfortunately the link at which they maintained the dataset does not exist anymore. Human and yeast were retrieved from J. Lee, W.-S. Han, R. Kasperovics, and J.-H. Lee, “An in-depth comparison of sub-graph isomorphism algorithms in graph databases,” PVLDB, vol. 6, no. 2, pp. 133–144, 2012. Wordnet is obtained form http://vlado.fmf.uni-lj.si/pub/networks/data/dic/Wordnet/Wordnet.htm

    Subgraph Querying with Parallel Use of Query Rewritings and Alternative Algorithms

    Get PDF
    Subgraph queries are central to graph analytics and graph DBs. We analyze this problem and present key novel discoveries and observations on the nature of the problem which hold across query sizes, datasets, and top-performing algorithms. Firstly, we show that algorithms (for both the decision and matching versions of the problem) suffer from straggler queries, which dominate query workload times. As related research caps query times not reporting results for queries exceeding the cap, this can lead to erroneous conclusions of the methods' relative performance. Secondly, we study and show the dramatic effect that isomorphic graph queries can have on query times. Thirdly, we show that for each query, isomorphic queries based on proposed query rewritings can introduce large performance benefits. Fourthly, that straggler queries are largely algorithm-specific: many challenging queries to one algorithm can be executed efficiently by another. Finally, the above discoveries naturally lead to the derivation of a novel framework for subgraph query processing. The central idea is to employ parallelism in a novel way, whereby parallel matching/decision attempts are initiated, each using a query rewriting and/or an alternate algorithm. The framework is shown to be highly beneficial across algorithms and datasets
    corecore